arXiv:1506.01598v2 [cs.MS] 17 Jun 2015 


decimalinfinite: All Decimals In Bits. 
No Loss. Same Order. Simple. 


Ghislain Fourny 
28msec, Inc. 
Zurich, Switzerland 
g(a)28.io 


ABSTRACT 

This paper introduces a binary encoding that supports ar¬ 
bitrarily large, small and precise decimals. It completely 
preserves information and order. It does not rely on any 
arbitrary use-case-based choice of calibration and is readily 
implementable and usable, as is. Finally, it is also simple to 
explain and understand. 

1. INTRODUCTION 

In the early stages of computers, storage was scarce. And it 
was even scarcer in the processor itself. The first comput¬ 
ers supported basic arithmetic operations on small integers 
(e.g., 4 bits). As more memory was needed, the size of the 
registers in the processor architecture was increased to fi¬ 
nally reach the current 64 bits. 

While this was mostly driven by the size of the memory, 
which never failed to exceed the maximum value space at 
some point, this also had an impact on the size of integers 
and decimals processed (e.g., Von-Neumann 0 ). 

The numbers used in programs are of two main kinds: inte¬ 
gers and decimals. Integers are typically bounded to some 
range. In the case of decimals, it is a bit more complicated. 
They are limited both in their range (driven by the size of 
their exponent) and precision. 

For scientists that need to use bigger numbers and/or more 
precision, special libraries are available, and domain-specific 
software mmm make it possible to overcome the limita¬ 
tions of processors. 

In the modern database era, notably document stores, syn¬ 
taxes such as XML [B] and JSON [^, but also their data 
models and the associated query languages, do not impose 
any limitation on integers or decimals: on the logical level, 
the entire value space is covered and the limit is only the 
size of the available storage. 


In the context of document stores, decimal numbers need 
to be efficiently stored and retrieved, but they also need 
to be used as index values. Indexing documents on deci¬ 
mal values is the primary use case for decimalinfinite. For 
efficiency reasons, it is desirable that an index lookup can 
be done without decoding the decimals stored in a hash in¬ 
dex (point queries) or a tree index (range queries). In the 
latter case, this is possible when the (lexicographic) order 
of encoded values corresponds to the ordering of the corre¬ 
sponding numbers. Then, only the queried decimals need to 
be encoded once, and the remaining comparisons occur on 
the binary sequences only. 

When decimalinfinite was designed, to the extent of our 
knowledge and our investigations, state-of-the-art databases 
did not support yet such an encoding that would cover the 
entire decimal space (that is, the entire value space of the 
XML xs:decimal type, or the entire value space of JSON 
numbers). However, many encodings exist that have some 
of the desired properties, decimalinfinite unifies the ideas in 
these encodings in such a way that all these properties apply 
simultaneously. In particular, none of the separate schemes 
used are new: decimalinfinite is only original in the way it 
puts them all together. 

This paper contributes an encoding that solves this prob¬ 
lem. More recently, we became aware of the Rishe encoding 
fSection l2.7ll . which solves the same problem. However, dec¬ 
imalinfinite uses a different approach and does not require 
any calibration of choice of intervals, decimalinfinite also 
improves the asymptotic complexity for large decimals or 
integers. 

The encoding and decoding algorithms have been imple¬ 
mented in C-|—I- and are used on production machines to 
store decimals as BSON binaries [I], on a MongoDB [12] 
data store. Another implementation is available in JSONiq 
|10| : the encoding takes only 117 lines of code. 

The decimalinfinite encoding relies on decomposing the un¬ 
derlying decimal into a normal form, as is typically done in 
other encodings. This decomposition is made of an overall 
sign, an exponent sign, an exponent and a significand (also 
called mantissa by Von Neumann, although, as is common 
practice, we prefer to avoid this terminology because of its 
different meaning for logarithm computation). These four 
components are encoded in turn and in this order. 


Section[2]gives an overview of the integer and decimal encod¬ 
ing landscape. Section [3] gives an overview of the main two 
ways of sorting sequences of bits. Section |3] gives the algo¬ 
rithm for encoding decimals in the decimalinfinite format. 
Section [S] gives the decoding algorithm. Section |S] gives a 
detailed proof of the main property of decimalinfinite (that 
it is order preserving). Section [7] shows the encoding of the 
smallest integers. Sect ion [8] shows how the core of the encod¬ 
ing can be extended to special numbers (NaN, infinity, etc). 
Section [5] briefly summarizes complexity aspects, including 
a comparison to the Rishe encoding. Section [10] gives a few 
implementation details. 

2. RELATED WORK 

There are various encodings of integers and decimals com¬ 
monly used in practice and found in literature. In general, 
one needs to distinguish between schemes for encoding digits 
and integers on the one hand (possibly used to encode dec¬ 
imals with fix-point semantics), and schemes based on the 
former that support floating point encodings on the other 
hand. 

2.1 Natural encoding 

Positive integers have a natural encoding which is simply 
their representation in base 2, as shown on Figure [T] 

It supports an unlimited range, however is not order-preser¬ 
ving. 

If however the range is limited, for example to 8, 16, 32 or 64 
bits, as is commonly done in most programming languages, 
and the encodings are padded with leading Os, then this 
encoding becomes order-preserving. 

2.2 Signed Integers 

Signed integers are commonly stored by encoding the sign 
in the first bit (Figure [2|), meaning that positive integers 
(beginning with a 0, which is half the range) are stored in 
the same way as the unsigned encoding, while negative in¬ 
tegers are stored beginning with a 1. Lexicographic order is 
only preserved for positive integers, as well as for negative 
integers, but not overall. 

2.3 Elias Gamma Code 

Gamma codes are a variable-length encoding that supports 
the entire non-negative integer range (N). One of their main 
usage and motivation is that they are prefix codes, meaning 
that they can get concatenated to each other in a way that 
they can still be separated again unambiguously. Figure [3] 
shows how the Gamma code looks like for the first integers. 

The main idea is that a first sequence of Os, terminated 
by a 1, encodes the length of the binary-representation of 
the integer. Then, the natural binary representation of the 
integer (most of the time offset by 1), without its leading 1, 
follows. 

Since the initial number of Os is identical to the number of 
digits that follow after the 1, it is possible to unambigu¬ 
ously deduct where the encoding stops, solely relying on the 
number of leading Os. 


Figure 1: Natural encoding of an integer (base 2), 
not padded and padded to 4 bits. 


Integer 

Binary representation 

Padded representation 

1 

1 

0001 

2 

10 

0010 

3 

11 

0011 

4 

100 

0100 

5 

101 

0101 





Figure 2: Natural encoding of signed integers (base 
2, padded to 4 bits). 


Integer 

Signed binary representation 

-3 

1101 

-2 

1110 

-1 

1111 

0 

0000 

1 

0001 

2 

0010 

3 

0011 




Figure 3: The Gamma Code for the first 

non-negative integers, and the modified, order¬ 
preserving gamma code. 


Integer 

Offset by 1 

Gamma 

Modihed Gamma 

0 

1 (1) 

1 

0 

1 

2 (10) 

01 0 

10 0 

2 

3 (11) 

01 1 

10 1 

3 

4 (100) 

001 00 

10 00 

4 

5 (101) 

001 01 

110 01 

5 

6 (110) 

001 10 

110 10 




































For example, 00110010 can be unambiguously separated into 
00110 (5) and 010 (1). 

The Gamma code in its original form is not order preserving. 
However, simply inverting the first sequence of Os and the 
terminating 1 solves the issue, as is shown in the last column 
of Figure O 

2.4 BCD, Chen-Ho, DPD 

Regardless of whether decimals are stored in fixed point for¬ 
mat or in floating point format, the sequence of their signif¬ 
icant digits needs to be encoded. 

The Binary-coded decimal encoding (BCD) encodes each 
digit in a group of 4 or 8 digits, so-called tetrades. Improve¬ 
ments include the Chen-Ho [S] encoding, which manages to 
encode 3 digits on 10 bits (declets). It has the nice particu¬ 
larity of being extremely efficient to process (no multiplica¬ 
tions, no divisions), and of being friendly to decimal compu¬ 
tations. However, it does not preserve order. The Densely 
packed decimal (DPD) encoding [5] is an improvement upon 
the Chen-Ho encoding. 

2.5 IEEE float and double encoding 

The IEEE 754 standard specifies a couple of standard, floating¬ 
point encoding for decimals more commonly known as float 
or double in mainstream programming languages. It both 
supports a finite range of decimals (its length is fixed), and 
is not order-preserving. It has several variants (binaryl6, 
binary32, binary64, binaryl28, decimal32, decimal64, deci- 
mall28) depending on the length and on the way the signif- 
icand is encoded. These encodings rely on DPD. 

2.6 IBM Patent 

The US Patent 7685214 (“Order-preserving encoding for¬ 
mats of floating-point decimal numbers for efficient value 
comparison”) , filed by IBM, solves the order-preserving is¬ 
sue, but with a finite-length encoding, which implies that it 
supports a finite range of decimals only. 

One very interesting idea in this approach is that, if the 
sign is negative, the significand m is encoded as 10 — m to 
preserve the order. 

2.7 The Matula-Kornerup and the Rishe En¬ 
codings 

The Rishe Encoding m is the closest match to decimal- 
infinite we found in literature in terms of problem solving. 

It supports arbitrarily large, small and precise decimals, is 
compact, and is also compatible with a bitwise lexicographic 
comparison. However, it relies on an arbitrary choice of in¬ 
tervals (128) based on the use case, decimalinfinite does not 
rely on such a choice and scales up continuously, regardless 
of how large, small or precise decimals are. 

The Rishe encoding was itself proposed as an improvement 
upon the Matula-Kornerup encoding which relies on a 
representation of the decimal as a continuous fraction. The 
latter did not scale up with exponents. The exponent part of 
Rishe scales up logarithmically in the decimal, while that of 
decimalinfinite scales up double-logarithmically (see Section 

EJ. 


3. ORDERING SEQUENCES OF BITS 

There are two widespread ways of sorting sequences of bits, 
as shown on Figure 01 The first one, pseudo-lexicographic 
(also called in literature shortlex, quasi-lexicographic, length- 
lexicographic), first orders by size, and then within a size, 
lexicographically. The second one, regardless of the size, 
compares the bits from the left to the right, with the con¬ 
vention that, when a sequence is a prefix of another, the 
shorter one comes first. 

MongoDB sorts binaries pseudo-lexicographically, even if we 
could find no documentation regarding this, decimalinfinite 
preserves the ordering of decimals with the semantic under¬ 
standing of the full lexicographic order. 

4. ENCODING 

Let us now get into the details of the encoding itself. The 
general idea is any non-zero decimal can be expressed in a 
canonical scientific form with four components (sign, expo¬ 
nent sign, exponent, significand). These four components 
can be encoded separately and concatenated. Since each of 
the components (but the last one) is a prefix code, it can be 
unambiguously decoded again. 

4.1 Canonical decomposition 

Zero is handled separately and encoded as 10. Any non¬ 
zero decimal number can be expressed uniquely in scientific 
notation as in commonly done in literature, that is, in the 
form 

s X m X 

where: 

• The overall sign is s £ {~1, !}• 

• The exponent e £ N is a non-negative integer (which is 
the absolute value of the logarithm in base ten of the 
absolute value of the original number, rounded down 
to the next integer). 

• The exponent sign isf£{ —1,1}. 

• The significand is m £ [1,10), a real number between 
1 (included) and 10 (excluded). 

If S denotes the encoding s, T that of t and so on, then 
the overall encoding comes naturally as STEM as shown 
on figure O This is because decimal numbers in scientific 
notation can be sorted with the following criteria in this 
order: 

1. sign 

2. exponent sign 

3. exponent 

4. significand 

Throughout this paper, four examples, which cover various 
combinations of the four components, will be used: 


-103.2 = -1.032 X 10^ 


-0.0405 = -4.05 X 10"^ 

0.707106 = 7.07106 x 10”^ 

4005012345 = 4.005012345 x lO'^ 

4.2 Encoding the sign 

The sign of a decimal is encoded on two bits as shown on 
Figure [6] 

Since zero is simply encoded with 10 with no further bits, it 
is already apparent that its encoding appears lexicographi¬ 
cally after the encoding of any negative decimal, and before 
the encoding of any positive decimal. 

The reason for using two bits rather than just one is that 
negative infinity (-INF), positive infinity (-I-INF) as well as 
negative zero and NaN can be conveniently encoded as well 
(see Section [S]). 


Figure 4: Two main ways of sorting binary se¬ 
quences 


Pseudo-lexicographic order 

Full lexicographic order 

0 

0 

1 

00 

00 

000 

01 

001 

10 

01 

11 

010 

000 

oil 

001 

1 

010 

10 

Oil 

100 

100 

101 

101 

11 

110 

110 

111 

111 


So far, our four examples have an encoding that begins as 
follow: 


-1.032 X 10^ 
-4.05 X 10"^ 

0 

7.07106 X 10“^ 
4.005012345 x 10® 


00 ... 

00 ... 

10 

10 ... 

10 ... 


4.3 Encoding the exponent sign 

The exponent sign is encoded on the third bit, as shown on 
figure 0 


Figure 5: Encoding of an overall decimal in scientific 
notation s x m x 10*’^®. 

I r I E I M~| 


The encoding of our four examples continues as follows: 


-1.032 X 10® 
-4.05 X 10"® 
7.07106 X 10-1 
4.005012345 x 10® 


00 0 ... 

00 1 ... 

10 0 ... 

10 1 ... 


4.4 Encoding the exponent 

The absolute value of the exponent is encoded with a modi¬ 
fied gamma code (as explained in section [2^ . using an offset 
of 2 0. 


Figure 6: Encoding of the overall decimal sign. 


s 

Sign s 

00 

negative sign (s = —1, e.g., —4.3 x 10'^) 

10 

positive sign (s = 1, e.g., 4.3 x 10'^) 


1. The exponent is offset by 4-2, for example, 4 is encoded 
with the modified gamma code of 6. 

2. The offset exponent is written in a binary form, for 
example, 6 is written 110. 

3. Call N the number of its digits (in the case of 110: 3). 

^The offset by 2 is needed, because the initial, length- 
discriminating sequence of the Gamma code must occupy 
at least two bits. With only one bit, it would be impossible 
to both deduce the sign of the exponent and the length of 
the exponent encoding. 


Figure 7: Encoding of the exponent sign. 


S and T 

s and t 

00 0 

negative sign, non-negative exponent sign 

00 1 

negative sign, negative exponent sign 

10 0 

positive sign, negative exponent sign 

10 1 

positive sign, non-negative exponent sign 




















Figure 8: Encoding the exponent. 


e 

e offset by 2 

TE 

(non negated, T = 1) 

TE 

(negated, T = 0) 

0 

2 (10) 

10 0 

01 1 

1 

3 (11) 

10 1 

01 0 

2 

4 (100) 

110 00 

001 11 

3 

5 (101) 

no 01 

001 10 

4 

6 (110) 

no 10 

001 01 

5 

7(111) 

no 11 

001 00 

6 

8 (1000) 

1110 000 

0001 111 

7 

9 (1001) 

1110 001 

0001 no 

8 

10 (1010) 

1110 010 

0001 101 

9 

11 (1011) 

1110 on 

0001 100 






Figure 9: Examples of significand encodings 


8.968(= 10 - 1.032) 

8 968 

1000 

1111001000 

5.95(= 10 - 4.05) 

5 950 

0101 

1110110110 

7.07106 

7 071 060 

0111 

0001000111 

0001111000 

4.005012345 

4 005 012 345 

0100 

0000000101 

0000001100 

0101011001 


The encoding of our four examples can now be completed: 


4. The first digit is replaced with N-1 ones, followed by a 
zero (in the case of 110: 110 10) 


Figure [8] shows how the smallest absolute values of the ex¬ 
ponent are encoded. 

Once the absolute value of the exponent has been encoded as 
shown above, it is either negated if T = 0, or left unchanged 
if T = 1. E is then obtained by dropping the first bit of the 
obtained string (because this first bit is already encoded in 
T). In other words, the (negated or not) absolute value of 
the exponent is encoded as TE. 


Note that 0 is treated as a non-negative exponent, so that it 
will always be encoded as 100 if the decimal is positive, and 
as Oil if the decimal is negative. This means that a decima- 
llnfinite encoding will never begin with 10011 or OO1OC0. 


The encoding of our four examples continues as follows: 


— 1.032 X 10^ (e = 2, opposite signs) 
—4.05 X 10“^ (e = 2, same sign) 
7.07106 X 10“^ (e = 1, opposite signs) 
4.005012345 x lO'^ (e = 9, same sign) 


00 001 11 ... 

00 110 00 ... 

10 01 0 ... 

10 1110 Oil... 


4.5 Encoding the significand 

The significand is encoded in a way similar to decimal32, 
decimal64 and decimall28, that is: 


• its initial digit (before the decimal point) is encoded 
on 4 bits (tetrade) in its natural binary representation. 

• the remaining digits (after the decimal point) are orga¬ 
nized in groups of 3 (declets). Each declet is encoded 
in its natural binary representation on 10 bits. Trail¬ 
ing Os are added to make sure that the last group also 
has 3 digits. 


— 1.032 X 10^ (10 — m is taken) 

00 001 11 1000 1111001000 

—4.05 X 10“^ (10 — m is taken) 

00 110 00 0101 1110110110 

7.07106 X 10“^ 

10 01 0 0111 0001000111 0001111000 
4.005012345 x 10® 

10 1110 011 0100 0000000101 0000001100 0101011001 

5. DECODING 

Decoding is also performed from left to right, in a way sim¬ 
ilar to encoding. 


5.1 Decoding the overall sign 

The overall decimal sign is obtained straightforwardly from 
the first two bits. If no more bits follow, it is a zero. Other¬ 
wise, decoding continues with the exponent. 

5.2 Decoding the exponent 

The exponent sign can be deduced from the third bit, but 
depends on the overall sign: 


• If the overall sign is - and the third bit is a 0, the 
exponent sign is +. 

• If the overall sign is - and the third bit is a 1, the 
exponent sign is -. 

• If the overall sign is -|- and the third bit is a 0, the 
exponent sign is -. 

• If the overall sign is -I- and the third bit is a 1, the 
exponent sign is +. 


If the overall sign of the decimal is negative though, a trick 
similar to the IBM patent is used: 10 — m is encoded instead 
of m (in this case, the leading digit may be a 0). 

®This could have been avoided, but at the cost of offsetting 
negative exponent encodings (and only them) by 1 instead 
of 2, which would have introduced more complexity 


The exponent encoding, starting at the third bit, is of vari¬ 
able length. Since gamma codes are prefix codes though, 
determining the length of the exponent encoding is straight¬ 
forward. 

One starts at the third bit and, including it, counts the 
number of identical bits that follow. If there is a sequence 

























of N identical bits (whether Os or Is) starting from the third 
bit, then one can deduce that the exponent was encoded on 
2N+1 bits. 

An example best illustrates this. 

1011100110100000000010100000011000101011001 . 

Starting from the third bit, there is a sequence of three Is, 
so the exponent is on 7 bits 

10 1110011 0100000000010100000011000101011001 . 

The next step is to flip all the bits in the exponent encoding 
if the leading bit is a 0. In the example, no change is needed. 

The exponent is then decoded as a modihed gamma code 
fSection 12.31) and offset by -2: 

1. The first N+1 bits are replaced with a 1 (in the exam¬ 
ple: 1011). 

2. The obtained bit sequence is decoded as a natural bi¬ 
nary representation (11). 

3. One substracts 2 (example: 9). 

If the obtained exponent is 0, but the exponent sign was en¬ 
coded as negative (i.e., the overall decimallnflnite encoding 
begins with 10011 or 00100), an error is raised. 

5.3 Decoding the significand 

The significand is decoded in groups of 10 bits (except the 
first group which has 4 bits). Each group is decoded as 
a natural binary representation. The first group gives the 
digit before the decimal point, the other groups give the 
digits (three per group) after the decimal point. 

If the first group does not deliver a number comprised be¬ 
tween 0 and 9, or a subsequent group does not deliver a 
number comprised between 0 and 999, an error is raised. 

Finally, If the overall sign is negative, the complement to 
10 is taken instead. An error is raised if the result is not 
comprised between 1 (included) and 10 (excluded). 

5.4 Summary of decoding errors 

The following errors can be raised upon decoding an invalid 
sequence: 

• the sequence begins with 01 or 11, but does not corre¬ 
spond to -INF, -l-INF or NaN (Section [S]) 

• 0 was encoded as a negative exponent, (an encoded 
sequence cannot begin with 10011 or 00100) 

• a tetrade or a declet is outside of the [0,9] or [000,999] 
range. 

• the overall significand, after possibly taking the com¬ 
plement to 10, is outside of the [1, 10) range, that is: 

— the encoded tetrade is 0 for a positive decimal, or 


— there are non-zero encoded declets after a 9 for a 
negative decimal. 

6. WHY IT IS ORDER-PRESERVING 

The encoding is designed in such a way that, if a < 6 (we 
treat zero separately), then the encoded a (STEMa) comes 
lexicographically before (<<) the encoded b (STEMb). 

A proof thereof now follows. 

We index s, t, e, m, S, T, E and M with a and b, that is, 
a’s (absolute) exponent is called Ca, b’s exponent is called 
Cb- a’s significand is called rria, b’s significand is called rUb, 
and so on. 

1. If a is negative and b is positive, then Sa = 00 and 
Sb = 10, so that STEMa « STEMb- 

2. If a and b are both positive, then Sa = Sb = 10. 

(a) If a’s exponent is negative and 6’s exponent is 
non-negative, the next digit in STEMa (Ta) will 
be a 0 and that of STEMb (Tb) a 1, so that 
STEMa « STEMb. 

Otherwise the exponents have the same sign. 

(b) If a’s exponent and 6’s exponent are both non¬ 
negative {Ta = Tb = T), and -f 2 has less digits 
than 66-1-2, then TEa will have less Is than Ti?6 in 
front of the next 0, so that STEMa << STEMb- 

(c) If a’s exponent and 6’s exponent are both negative 
{Ta = Tb = 0), and -|- 2 has more digits than 
66 - 1 - 2 , then TEa will have more Os than TEb in 
front of the next 1, so that STEMa « STEMb. 
Otherwise the exponents have the same sign and 
their offsets by 2 have the same number of bits. 

(d) If a’s exponent and 6’s exponent are both non¬ 
negative {Ta = Tb = 1) but different {ca < Cb) 
and Ca + 2 has as many digits {N) as Cb + 2, then 
TEa and TEb will both have A — 1 Is followed by 
a 0. The next N — 1 digits after the 0 in TEa and 
TEb correspond to a natural binary representa¬ 
tion (with no leading 1) of Sa and 66 respectively, 
so that STEMa « STEMb because the natural 
binary representations preserve order given a fix 
number of digits. 

(e) If a’s exponent and 6’s exponent are both nega¬ 
tive {Ta = Tb = 0) but different (ca > Cb) and 
Ca -I- 2 has as many digits {N) than 66 -I- 2, then 
TEa and TEb will both have A — 1 Os followed 
by a 1. The next A — 1 digits after the 0 in TEa 
and TEb correspond to an inverted natural binary 
representation, with no leading 0, of Ca and 66 re¬ 
spectively. Since Ca > 66, Ea << Eb because it’s 
inverted, and STEMa « STEMb. 

Otherwise the exponents are identical. 

(f) If a’s exponent and 6’s exponent are equal, then 
rua < rUb and then TEa = TEb- Ma and Mb, 
organized in one group of 4, then groups of 10, 
are all natural binary representations of the sym¬ 
bols of nia and mb in base 1000, and preserve 
the order, so that Ma « Mb and STEMa << 
STEMb. 


3. If a and b are both negative, then Sa = St = 00. 

(a) If a’s exponent is non-negative and b’s exponent is 
negative, the next digit in STEMa will be a 0 and 
that of STEMb a 1, so that STEMa « STEMb- 
Otherwise the exponents have the same sign. 

(b) If a’s exponent and b’s exponent are both non¬ 
negative {Ta = Tb = 0), and Ca + “2 has more 
digits than 6^+2, then T Ea will have more Os than 
TEb in front of the next 1, so that STEMa << 
STEMb. 

(c) If a’s exponent and b’s exponent are both negative 
{Ta = Tb = 1), and ea -I- 2 has less digits than 
Cb + 2, then TEa will have less Is than TEb in 
front of the next 0, so that STEMa << STEMb- 
Otherwise the exponents have the same sign and 
their offsets by 2 have the same number of bits. 

(d) If a’s exponent and b’s exponent are both nega¬ 
tive {Ta — Tb = 1) but different {ca < Sb) and 
ea -f 2 has as many digits {N) than Cb + 2, then 
TEa and TEb will both have — 1 Is followed by 
a 0. The next N — I digits after the 0 in Tifa and 
TEb correspond to a natural binary representa¬ 
tion (with no leading 1) of Ca and Cb respectively, 
so that STEMa « STEMb because the natural 
binary representations preserve order given a fix 
number of digits. 

(e) If a’s exponent and b’s exponent are both non¬ 
negative {Ta = Tb = 0) but different (ca > Cb) 
and ea-l-2 has as many digits {N) than 66-1-2, then 
TEa and TEb will both have A'’ — 1 Os followed 
by a 1. The next N — 1 digits after the 0 in 
TEa and TEb correspond to an inverted natural 
binary representation, with no leading 0, of c and 
d respectively. Since Ca >66, Ea « Eb because 
it’s inverted, and STEMa « STEMb- 
Otherwise the exponents are identical. 

(f) If a’s exponent and b’s exponent are equal, then 
rua > mb and TEa = TEb- Ma and Mb, orga¬ 
nized in one group of 4, then groups of 10, are all 
natural binary representations of the symbols of 
10 — ma and 10 — mb in base 1000, and preserve 
the order. Since 10 — ma < 10 — mb, Ma « Mb 
and and STEMa « STEMb- 

4. 0 is encoded as 10 and 10 is lexicographically smaller 
than the encodings of negative decimals, which begin 
with 00. it is lexicographically greater than the encod¬ 
ings of positive decimals, which begin with 10 followed 
by at least one further digit. 

7. EXAMPLES 

Figure fTOl shows the decimalinfinite encoding for the smallest 
integers. Integers with an absolute value between 1 and 10 
are encoded on 9 bits, and until 100 on 19 bits (less if trailing 
zeros are removed, as shown in Section IQ) . 

8. FINE-TUNING THE SCHEME 

8.1 Special numbers 

Special numbers such as positive and negative infinity, neg¬ 
ative zero can also be encoded in such a way that the lexi¬ 
cographic order still holds, as shown on figure [TT] NaN can 


Figure 10: The encoding of the smallest integers 
(the space separators are only here for an easier 
read). 


Decimal 

decimalinfinite encoding 

-15 

00 010 1000 0111110100 

-14 

00 010 1000 1011111000 

-13 

00 010 1000 1010111100 

-12 

00 010 1000 1100100000 

-11 

00 010 1000 1110000100 

-10 

00 010 1001 

-9 

00 oil 0010 

-8 

00 oil 0010 

-7 

00 oil 0011 

-6 

00 oil 0100 

-5 

00 oil 0101 

-4 

00 oil 0110 

-3 

00 oil 0111 

-2 

00 oil 1000 

-1 

00 oil 1001 

0 

10 

1 

10 100 0001 

2 

10 100 0010 

3 

10 100 0011 

4 

10 100 0100 

5 

10 100 0101 

6 

10 100 0110 

7 

10 100 0111 

8 

10 100 1000 

9 

10 100 1001 

10 

10 101 0001 

11 

10 101 0001 0001100100 

12 

10 101 0001 0011001000 

13 

10 101 0001 0100101100 

14 

10 101 0001 0110010000 

15 

10 101 0001 0111110100 


also be encoded (even if the order does not apply in this 
case). 

8.2 Trailing zeros 

To save space, trailing zeros can be removed from the binary 
encoding and added back while decoding (to fit the size of 
the last declet group). 

8.3 Fix-length variant 

In environments where encoding preserving lexicographic or¬ 
der is not supported across different lengths (this is the case 
with MongoDB’s ordering of binaries), this encoding can be 
adapted to work at the cost of limiting the range. A prefix 
of the binary encoding can be taken as an approximation of 
the encoded decimal, possibly padded with leading Os if too 
short. This works as long as the total stored lengths exceeds 
the length of the encoding of the sign and exponent, which 
limits the range. 

8.4 Prefix code variant 

As suggested by Nathan Hurst, decimalinfinite can be turned 
into a prefix code (self-delimiting). This can be achieved by 




































Figure 11: Adding special numbers. 


00 

-INF 

00... 

negative sign (e.g., —4.3 x 10'^) 

01 

negative zero 

10 

positive zero 

10... 

positive sign (e.g., 4.3 x 10'^) 

11 

-bINF 

111 

NaN 


adding a bit after each tetrade and declet in the significand 
encoding, to indicate whether further bits follow (1) or not 
(0), as is done in the Rishe encoding. 

9. COMPLEXITY 

The storage space is linear in the size (number of digits) 
of the significand, and logarithmic in the exponent, that is, 
double-logarithmic in the decimal. 

The encoding size of a decimal can be computed with: 


5 -b 2[log2(e -b 2)J -b y - 1) 

Where |m| denotes the number of decimal digits in the sig¬ 
nificand. Asymptotically, it can be approximated further 
with 

21og2e-b yim| 

that is 

0(loge -b |m|) 

The encoding of the significand uses a common approach 
that is very compact in terms of entropy, and making it more 
compact (for example, by grouping bits in bigger groups) 
would increase computational complexity. The factor of ^ 
is slightly better than the factor of 4 specified in m, where 
8 bits are on average necessary for each additional two sig¬ 
nificant digits. For the sake of a fair comparison, note that 
the prefix-code version (Section l8.4ll of decimalinfinite would 
increase the factor to Also, the Rishe encoding of very 
small integers is, by design, more compact. 

The encoding of the exponent deviates from an optimal size 
by a constant factor of 2, which is the cost of using the 
Gamma prefix code. For large decimals, the Rishe encod¬ 
ing contributes a factor of O.Slogiorf (logarithmic) to the 
exponent part (semi-progressive intervals require 8 bits for 
an increase of 10 in the exponent), whereas decimalinfinite 
contributes only 2 log^Q logj^g d (double-logarithmic). 

If only non-negative integers (i) are considered, then 
\m\ = logio* 

is asymptotically more significant than 

logs e = log2 [logio *J 

so that the encoding size grows logarithmically in i, which 
is an optimal complexity (see Figure [9ll. 


Figure 12: Asymptotic size of the encoding for inte¬ 
gers (obtained experimentally with the JSONiq im¬ 
plementation) . 


The blue lines show the fix-length binary natural represen¬ 
tation on 32, 64 and 128 bits on the range they cover. 



Integer (logarithmic scale) 

This logarithmic complexity is identical to that of the Rishe 
encoding, except that the constant is slightly lower (3.3 vs. 
4.8). 

10. IMPLEMENTATION 

The encoding and decoding algorithms were successfully im¬ 
plemented in C-b-b, and are used to store JSON numbers 
lossless in BSON binaries on an underlying MongoDB layer. 
The decimals are stored with a special library supporting 
big numbers, however for efficiency reasons we do use clas¬ 
sical limited-range integers when dealing with exponents or 
the number of digits of the decimal to encode. This is not a 
limitation of the decimalinfinite algorithm, but more a prag¬ 
matic decision that it is unlikely that decimals on the order 

rtSS oo 

of magnitude of 10 , or with a precision of 2 digits would 

be encountered in a real-life setting. 

Unfortunately, MongoDB does not use a full lexicographic 
ordering of binaries (rather, pseudo-lexicographically), such 
that padding to a fixed length was needed for this vendor. 

The code has been running with no known issues on produc¬ 
tion servers since 2012. 

Also, an implementation in JSONiq is available on github 
|10| . and is exposed publicly via a very basic REST API. 

11. CONCLUSION 

We introduced a binary encoding that supports the entire 
decimal value space including special numbers, that does 
not lose precision, and that preserves the order, in the sense 
that the encoding is a homomorphism between the decimals 
(sorted naturally) and the bit sequences (sorted fully lexico¬ 
graphically). This encoding is simple to explain: its specifi¬ 
cation in this paper fits on 2 pages. It is not parameterized 
in any way, making it implementable and understandable in 
a straightforward way. 
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